The dataset that we are using for this project is sampled from a large dataset of Women's Shoes from Datafiniti database. It contains around 10,000 online women shoes listings from April 2015 to April 2019 (9,709 after data cleaning). The dataset has 32 columns and includes information such as name, brand, minimum and maximum prices, size, color, url links and additionals for each offered pair of women shoes. For the purposes of our analysis, we decided to focus mainly on some columns including date, price, brands, colors, name and size.
By analyzing this dataset, we would like to identify what trends the shoes manufacturers follow nowadays and what kind of pricing strategies shoe brands have now. Our expectation is to analyze the data in different dimensions and give conclusions and recommendations on shoe trends by answering these questions:
In this project, we use the same dataset as the BA 770 one, and we have known that there are outliers. To do a more systematic analysis, we start with data cleaning and organizing.
install.packages("tidyverse")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/tidyverse_1.2.1.tgz'
Content type 'application/x-gzip' length 89217 bytes (87 KB)
==================================================
downloaded 87 KB
The downloaded binary packages are in
/var/folders/x5/gtbnd5cj5rl0c9h7jtktvqn80000gn/T//RtmpUNMowA/downloaded_packages
install.packages("plotly")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/plotly_4.9.0.tgz'
Content type 'application/x-gzip' length 2801589 bytes (2.7 MB)
==================================================
downloaded 2.7 MB
The downloaded binary packages are in
/var/folders/x5/gtbnd5cj5rl0c9h7jtktvqn80000gn/T//RtmpUNMowA/downloaded_packages
install.packages("ggplotly")
Warning in install.packages :
package ‘ggplotly’ is not available (for R version 3.6.1)
install.packages("ggthemes")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/ggthemes_4.2.0.tgz'
Content type 'application/x-gzip' length 420298 bytes (410 KB)
==================================================
downloaded 410 KB
The downloaded binary packages are in
/var/folders/x5/gtbnd5cj5rl0c9h7jtktvqn80000gn/T//RtmpUNMowA/downloaded_packages
install.packages("gridExtra")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/gridExtra_2.3.tgz'
Content type 'application/x-gzip' length 1103470 bytes (1.1 MB)
==================================================
downloaded 1.1 MB
The downloaded binary packages are in
/var/folders/x5/gtbnd5cj5rl0c9h7jtktvqn80000gn/T//RtmpUNMowA/downloaded_packages
install.packages("ggpubr")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/ggpubr_0.2.3.tgz'
Content type 'application/x-gzip' length 1819911 bytes (1.7 MB)
==================================================
downloaded 1.7 MB
The downloaded binary packages are in
/var/folders/x5/gtbnd5cj5rl0c9h7jtktvqn80000gn/T//RtmpUNMowA/downloaded_packages
# commands above can be skipped if these have already been installed on current working environment
library(readr)
library(tidyverse)
[30m── [1mAttaching packages[22m ────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 3.2.1 [32m✔[30m [34mpurrr [30m 0.3.2
[32m✔[30m [34mtibble [30m 2.1.3 [32m✔[30m [34mdplyr [30m 0.8.3
[32m✔[30m [34mtidyr [30m 1.0.0 [32m✔[30m [34mstringr[30m 1.4.0
[32m✔[30m [34mggplot2[30m 3.2.1 [32m✔[30m [34mforcats[30m 0.4.0[39m
[30m── [1mConflicts[22m ───────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(ggplot2)
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(dplyr)
library(lubridate)
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
library(tidyr)
library(stringr)
library(ggthemes)
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
library(ggpubr)
Loading required package: magrittr
Attaching package: ‘magrittr’
The following object is masked from ‘package:purrr’:
set_names
The following object is masked from ‘package:tidyr’:
extract
women_shoes <- Final_version_Final_version_3_
options(stringsAsFactors = FALSE)
Check for the unique number of brands (before cleaning):
length(unique(women_shoes$brand))
[1] 494
brand and drop all NAswomen_shoes <- drop_na(women_shoes)
women_shoes$brand <- tolower(women_shoes$brand)
women_shoes$brand <- gsub("-"," ",women_shoes$brand)
for (i in 1: nrow(women_shoes)) {
if (women_shoes$brand[i] == "aeorosoles") {
women_shoes$brand[i] <- "aerosoles"
}
if (women_shoes$brand[i] == "bamboo Brand" ) {
women_shoes$brand[i] <- "bamboo"
}
if (women_shoes$brand[i] == "baretraps") {
women_shoes$brand[i] <- "bare traps"
}
if (women_shoes$brand[i] == "beacon" ) {
women_shoes$brand[i] <- "beacon shoes"
}
if (women_shoes$brand[i] == "benjamin walk" ) {
women_shoes$brand[i] <- "benjamin walk dyeables"
}
if (women_shoes$brand[i] == "carlos by carlo santana" ) {
women_shoes$brand[i] <- "carlos"
} else if (women_shoes$brand[i] == "carlos by carlos santana") {
women_shoes$brand[i] <- "carlos"
}
if (women_shoes$brand[i] == "cat footwear") {
women_shoes$brand[i] <- "cat"
}
if (women_shoes$brand[i] == "charles by charles david") {
women_shoes$brand[i] <- "charles david"
}
if (women_shoes$brand[i] == "cityclassified" ) {
women_shoes$brand[i] <- "city classified"
}
if (women_shoes$brand[i] == "diba" ) {
women_shoes$brand[i] <- "diba true"
}
if (women_shoes$brand[i] == "dr. scholl's shoes" ) {
women_shoes$brand[i] <- "dr. scholl's"
}
if (women_shoes$brand[i] == "drew shoe" ) {
women_shoes$brand[i] <- "drew"
}
if (women_shoes$brand[i] == "ellie shoes") {
women_shoes$brand[i] <- "ellie"
}
if (women_shoes$brand[i] == "fergie footwear" ) {
women_shoes$brand[i] <- "fergie"
} else if (women_shoes$brand[i] == "fergie shoes") {
women_shoes$brand[i] <- "fergie"
}
if (women_shoes$brand[i] == "genuine grip footwear") {
women_shoes$brand[i] <- "genuine grip"
}
if (women_shoes$brand[i] == "j.rene") {
women_shoes$brand[i] <- "j. renee"
}
if (women_shoes$brand[i] == "jbu by jambu" ) {
women_shoes$brand[i] <- "jbu"
}
if (women_shoes$brand[i] == "koolaburra by ugg" ) {
women_shoes$brand[i] <- "koolaburra"
}
if (women_shoes$brand[i] == "l'artiste by spring step" ) {
women_shoes$brand[i] <- "lartiste"
} else if (women_shoes$brand[i] == "lartiste by spring step") {
women_shoes$brand[i] <- "lartiste"
}
if (women_shoes$brand[i] == "lauren by ralph lauren" ) {
women_shoes$brand[i] <- "ralph lauren"
} else if (women_shoes$brand[i] == "lauren ralph lauren" ) {
women_shoes$brand[i] <- "ralph lauren"
}
if (women_shoes$brand[i] == "life stride" ) {
women_shoes$brand[i] <- "lifestride"
} else if (women_shoes$brand[i] == "lifestride shoes" ) {
women_shoes$brand[i] <- "lifestride"
}
if (women_shoes$brand[i] == "michael michael kors" ) {
women_shoes$brand[i] <- "michael kors"
}
if (women_shoes$brand[i] == "pleaserusa" ) {
women_shoes$brand[i] <- "pleaser usa"
}
if (women_shoes$brand[i] == "skechers work") {
women_shoes$brand[i] <- "skechers"
}
if (women_shoes$brand[i] == "soda shoes" ) {
women_shoes$brand[i] <-"soda"
}
if (women_shoes$brand[i] == "soft style by hush puppies" ) {
women_shoes$brand[i] <-"soft style"
}
if (women_shoes$brand[i] == "softwalk footwear" ) {
women_shoes$brand[i] <-"softwalk"
}
if (women_shoes$brand[i] == "sorel footwear" ) {
women_shoes$brand[i] <-"sorel"
}
if (women_shoes$brand[i] == "sperry top sider" ) {
women_shoes$brand[i] <- "sperry top sider"
} else if (women_shoes$brand[i] == "sperry" ) {
women_shoes$brand[i] <- "sperry top sider"
}
if (women_shoes$brand[i] == "summitfashions" ) {
women_shoes$brand[i] <-"summit fashions"
}
if (women_shoes$brand[i] == "toms shoes" ) {
women_shoes$brand[i] <- "toms"
}
if (women_shoes$brand[i] == "ugg australia" ) {
women_shoes$brand[i] <-"ugg"
}
if (women_shoes$brand[i] == "vionic with orthaheel technology" ) {
women_shoes$brand[i] <- "vionic"
}
}
length(unique(women_shoes$brand))
[1] 390
glimpse(women_shoes)
Observations: 9,709
Variables: 9
$ id [3m[90m<chr>[39m[23m "AWpyySsJAGTnQPR7wNt4", "AWpyyyb3AGTnQPR7wN-u", "AWpyzlajAG…
$ dateAdded [3m[90m<date>[39m[23m 2019-05-01, 2019-05-01, 2019-05-01, 2019-05-01, 2019-05-01…
$ brand [3m[90m<chr>[39m[23m "city classified", "skechers", "floral", "jambu", "trotters…
$ colors [3m[90m<chr>[39m[23m "Black", "Taupe", "Black", "Black/Multi", "White", "Metalli…
$ name [3m[90m<chr>[39m[23m "City Classified Stylish-S Women's Zipper Ankle Booties", "…
$ prices.amountMax [3m[90m<dbl>[39m[23m 29.95, 84.00, 49.95, 127.20, 47.66, 82.64, 79.95, 62.41, 62…
$ prices.amountMin [3m[90m<dbl>[39m[23m 29.95, 84.00, 49.95, 127.20, 47.66, 82.64, 79.95, 62.41, 62…
$ prices.offer [3m[90m<chr>[39m[23m "Online only", "Online only", "Online only", "Online only: …
$ sizes [3m[90m<chr>[39m[23m "8", "6", "5", "9.5", "11", "11", "9", "6", "5", "8", "7", …
women_shoes %>%
group_by(sizes) %>%
mutate(
discount = case_when(
str_detect(prices.offer, pattern = "off") ~ "Yes",
TRUE ~ "No"
)
) %>%
filter(discount == "Yes") %>%
summarize(
count = n()
) %>%
select(sizes, count) %>%
arrange(desc(count)) %>%
filter(sizes != "multiple")-> sale
head(sale)
We can see from chart above that the most popular sizes while on sale are size 8 and 7.
| ## 2. Color ### Popular colors among shoes |
r Idcolors <- unique(women_shoes$colors) Idcolors <- as.data.frame(Idcolors) temp <- separate_rows(Idcolors, Idcolors, sep = ",") summary(temp) |
Idcolors Length:2379 Class :character Mode :character |
r Idcol <- unique(temp) class(Idcol) |
[1] "data.frame" |
r summary(Idcol) |
Idcolors Length:1955 Class :character Mode :character |
r for (i in 1:nrow(Idcol)) { Idcol$count[i] <- sum(str_count(women_shoes$colors, pattern = Idcol$Idcolors[i]), na.rm = TRUE) } Idcol <- arrange(Idcol, desc(count)) Idcolplot <- head(Idcol, 10) %>% filter(Idcolors !="Ta", Idcolors !="Patent" ) pct <- round(Idcolplot$count/sum(Idcolplot$count)*100) lbls <- paste(Idcolplot$Idcolors, pct) lbls <- paste(lbls,"%",sep="") pie(pct,labels = lbls, col=c("#ADDED4","#B5D4E9","#C3CED9","#C9E4B4"), main="Popular colors in general") |
| Black and white take over almost 60% among all colors, and each of the rest has less than 10%. |
| ### Popular colors in each season We make a copy from the original dataset to work on. |
r shoesw <- women_shoes summary(shoesw) |
| ``` id dateAdded brand colors Length:9709 Min. :2015-04-01 Length:9709 Length:9709 Class :character 1st Qu.:2019-04-19 Class :character Class :character Mode :character Median :2019-04-19 Mode :character Mode :character Mean :2018-12-29 3rd Qu.:2019-04-23 Max. :2019-05-01 name prices.amountMax prices.amountMin prices.offer Length:9709 Min. : 5.00 Min. : 5.00 Length:9709 Class :character 1st Qu.: 34.99 1st Qu.: 34.99 Class :character Mode :character Median : 52.07 Median : 52.07 Mode :character Mean : 72.76 Mean : 72.76 3rd Qu.: 85.77 3rd Qu.: 85.77 Max. :5000.00 Max. :5000.00 sizes Length:9709 Class :character Mode :character |
| ``` |
| Then we construct 4 different dataframes for each season. |
r shoesw %>% mutate(season =ifelse(month(dateAdded) %in% c(1,2,12), "Winter", ifelse(month(dateAdded) %in% c(3,4,5), "Spring", ifelse(month(dateAdded) %in% c(6,7,8), "Summer", "Fall")))) -> shoesw |
| Glimpse at data frame with new columm. |
r glimpse(shoesw) |
Observations: 9,709 Variables: 10 $ id [3m[90m<chr>[39m[23m "AWpyySsJAGTnQPR7wNt4", "AWpyyyb3AGTnQPR7wN-u", "AWpyzlajAG… $ dateAdded [3m[90m<date>[39m[23m 2019-05-01, 2019-05-01, 2019-05-01, 2019-05-01, 2019-05-01… $ brand [3m[90m<chr>[39m[23m "city classified", "skechers", "floral", "jambu", "trotters… $ colors [3m[90m<chr>[39m[23m "Black", "Taupe", "Black", "Black/Multi", "White", "Metalli… $ name [3m[90m<chr>[39m[23m "City Classified Stylish-S Women's Zipper Ankle Booties", "… $ prices.amountMax [3m[90m<dbl>[39m[23m 29.95, 84.00, 49.95, 127.20, 47.66, 82.64, 79.95, 62.41, 62… $ prices.amountMin [3m[90m<dbl>[39m[23m 29.95, 84.00, 49.95, 127.20, 47.66, 82.64, 79.95, 62.41, 62… $ prices.offer [3m[90m<chr>[39m[23m "Online only", "Online only", "Online only", "Online only: … $ sizes [3m[90m<chr>[39m[23m "8", "6", "5", "9.5", "11", "11", "9", "6", "5", "8", "7", … $ season [3m[90m<chr>[39m[23m "Spring", "Spring", "Spring", "Spring", "Spring", "Spring",… |
| #### -Winter color: |
r Winter <- shoesw %>% filter(season == "Winter") Wcolors <- unique(Winter$colors) Wcolors <- as.data.frame(Wcolors) temp <- separate_rows(Wcolors, Wcolors, sep = ",") summary(temp) |
Wcolors Length:341 Class :character Mode :character |
r Wcol <- unique(temp) for (i in 1:nrow(Wcol)) { Wcol$count[i] <- sum(str_count(Winter$colors, pattern = Wcol$Wcolors[i])) } Wcol <- arrange(Wcol, desc(count)) Wcolplot <- head(Wcol, 10) ggplot(data = Wcolplot, mapping = aes(x = reorder(Wcolors, -count), y = count, fill= reorder(Wcolors, -count))) + geom_bar(stat = "identity")+ scale_fill_manual(values = c("White" = "White", "Black" = "Black", "Blue" = "#0000FF", "Navy" = "#000080", "Bone" = "#e3dac9", "Turquo" = "#40E0D0", "Grey" = "#808080", "Beige" = "#f5f5dc", "Natural" = "#caa472", "Pewter" = "#8e9294")) + labs(x = "colors", y = "number of listings", title = "Popular colors in Winter") + theme(legend.position="none") |
| The most popular color during Winter is “White” and “Black”. Most of those colors are of colder tones, which is quite understandable considering that winter is a cold time of the year. |
| #### -Fall color: |
r Fall <- shoesw %>% filter(season == "Fall") Fcolors <- unique(Fall$colors) Fcolors <- as.data.frame(Fcolors) temp <- separate_rows(Fcolors, Fcolors, sep = ",") summary(temp) |
Fcolors Length:193 Class :character Mode :character |
r Fcol <- unique(temp) class(Fcol) |
[1] "data.frame" |
r for (i in 1:nrow(Fcol)) { Fcol$count[i] <- sum(str_count(Fall$colors, pattern = Fcol$Fcolors[i])) } Fcol <- arrange(Fcol, desc(count)) Fcolplot <- head(Fcol, 11) %>% filter(Fcolors != "All Black") ggplot(data = Fcolplot, mapping = aes(x = reorder(Fcolors, -count), y = count, fill = reorder(Fcolors, -count))) + geom_bar(stat = "identity")+ scale_fill_manual(values = c("Black" = "Black", "Silver" = "#C0C0C0", "Brown" = "#A52A2A", "White" = "White", "Tan" = "#D2B48C", "Taupe" = "#b38b6d", "Periwinkle" = "#CCCCFF", "Chocolate" = "#D2691E","Cinnamon" = "#d2691e", "Blue" ="#0000FF")) + labs(x = "colors", y = "number of listings", title = "Popular colors in Fall") + theme(legend.position="none") |
| The most popular color for Fall season is “Black”. The color tones look quite suitable for Fall season. |
| #### -Summer color: |
r Summer <- shoesw %>% filter(season == "Summer") Sucolors <- unique(Summer$colors) Sucolors <- as.data.frame(Sucolors) temp <- separate_rows(Sucolors, Sucolors, sep = ",") summary(temp) |
Sucolors Length:23 Class :character Mode :character |
r Sucol <- unique(temp) class(Sucol) |
[1] "data.frame" |
r for (i in 1:nrow(Sucol)) { Sucol$count[i] <- sum(str_count(Summer$colors, pattern = Sucol$Sucolors[i])) } Sucol <- arrange(Sucol, desc(count)) Sucolplot <- head(Sucol, 10) %>% filter(Sucolors != "Black Leopard") ggplot(data = Sucolplot, mapping = aes(x = reorder(Sucolors, -count), y = count, fill = reorder(Sucolors, -count))) + geom_bar(stat = "identity")+ scale_fill_manual(values = c("Black" = "Black","Brown" = "#A52A2A", "Dusty" = "#ac9b9b", "Leopard" = "#a37319", "White" = "White", "Taupe" = "#b38b6d", "Black Suede" = "#3d3e3c", "Beige" = "#f5f5dc", "Dusty Brown Suede" ="#A17F72")) + labs(x = "colors", y = "number of listings", title = "Popular colors in Summer") + theme(legend.position="none") |
| The most popular color for Summer season is also “Black”. Which is a little bit surprising, as we would expect more bright and popping colors being the most popular ones during Summer time. Nevertheless, “Brown”, “Dusty” and “Leopard” colors are equally popular during Summer. Apparently, “Leopard” print is very popular during summer time. |
| #### -Spring color: |
r Spring <- shoesw %>% filter(season == "Spring") Spcolors <- unique(Spring$colors) Spcolors <- as.data.frame(Spcolors) tmp <- separate_rows(Spcolors, Spcolors, sep = c(",")) summary(tmp) |
Spcolors Length:1856 Class :character Mode :character |
r Spcol <- unique(tmp) class(Spcol) |
[1] "data.frame" |
r for (i in 1:nrow(Spcol)) { Spcol$count[i] <- sum(str_count(Spring$colors, pattern = Spcol$Spcolors[i]), na.rm = TRUE) } Spcol <- arrange(Spcol, desc(count)) Spcolplot <- head(Spcol, 10) %>% filter(Spcolors != "Ta", Spcolors != "Patent") ggplot(data = Spcolplot, mapping = aes(x = reorder(Spcolors, -count), y = count, fill = reorder(Spcolors, -count))) + geom_bar(stat = "identity")+ scale_fill_manual(values = c("Black" = "Black","White" = "White", "Grey" = "#808080", "Brown" = "#A52A2A", "Red" = "#FF0000", "Gold" = "#FFD700", "Silver" = "#C0C0C0","Beige" = "#f5f5dc"))+ labs(x = "colors", y = "number of listings", title = "Popular colors in Spring") + theme(legend.position="none") |
| The most popular color for Spring season is also “Black”. Also here we see more |
| Conclusion: Except for Winter, the most popular color is Black, which is quite understandable as black shoes are considered as a common option, because they usually fit well with any other colors in a whole outfit. Winter’s popular color, White, is also understandable as people may tend to match snow color in winter. Moreover, the popular colors in seasons tend to match the tones of the seasons. |
women_shoes$month <- months(as.Date(women_shoes$dateAdded))
women_shoes$year <- year(as.Date(women_shoes$dateAdded))
women_shoes%>%
filter(year == "2015") -> new_date
ggplot(new_date, aes(x = month)) +
geom_bar(stat = "count", na.rm = TRUE, fill = "darkseagreen3")
We limited our dataset to only one year range as our dataset have so many missing data, for example we only have data for April and May in 2019. Based on the result we can see that April and September are the month that most brand launch their new product follow by August, November and December. We can assume that most brand usually release their products for between two seasons.
In Price section, we need to further clean the dataset. We would like to change all the letters in name into lowercase and create a new dataframe called new_df.
women_shoes2 <- transform(women_shoes, lowname = tolower(name))
women_shoes2 %>%
group_by(id) %>%
mutate(
prices = (prices.amountMax + prices.amountMin)/2
) %>%
select(id, dateAdded, brand, lowname, prices, prices.offer) -> new_df
head(new_df)
Then we want to find outliers through a boxplot gragh. Based on the result, we consider price over $500 are outliers.
tmp <- new_df %>%
ggplot(aes(x = 1, y = prices)) +
geom_boxplot(color = "black") +
coord_flip() +
theme_classic()
ggplotly(tmp)
A new boxplot after we remove these outliers (prices > $500):
tmp0 <- new_df %>%
filter(prices < 500) %>%
ggplot(aes(x = 1, y = prices)) +
geom_boxplot(color = "black") +
coord_flip() +
theme_classic()
ggplotly(tmp0)
Next, we exclude outliers and extract the pattern from name. We also create a new column category specifying different categories for each pair of shoes. Finally, we extract “off” from prices.offer and create a new column discount which indicates whether the price is on sale or not.
test1 <- new_df %>%
mutate(
category = case_when(
str_detect(lowname, pattern = "boot") ~ "boots",
str_detect(lowname, pattern = "pump") ~ "pump",
str_detect(lowname, pattern = "sneaker") ~ "sneaker",
str_detect(lowname, pattern = "sandal") ~ "sandal",
str_detect(lowname, pattern = "flat") ~ "flat",
str_detect(lowname, pattern = "slipper") ~ "slipper",
str_detect(lowname, pattern = "loafer") ~ "loafer",
str_detect(lowname, pattern = "clog") ~ "clog"
),
discount = case_when(
str_detect(prices.offer, pattern = "off") ~ "Yes",
TRUE ~ "No"
)
) %>%
filter(prices < 500)
test1$category[is.na(test1$category)] <- "others"
test1 %>%
select(id, dateAdded, brand, lowname, prices, category, discount) %>%
arrange(desc(prices))-> new_df2
head(new_df2)
summary(women_shoes)
id dateAdded brand colors
Length:9709 Min. :2015-04-01 Length:9709 Length:9709
Class :character 1st Qu.:2019-04-19 Class :character Class :character
Mode :character Median :2019-04-19 Mode :character Mode :character
Mean :2018-12-29
3rd Qu.:2019-04-23
Max. :2019-05-01
name prices.amountMax prices.amountMin prices.offer
Length:9709 Min. : 5.00 Min. : 5.00 Length:9709
Class :character 1st Qu.: 34.99 1st Qu.: 34.99 Class :character
Mode :character Median : 52.07 Median : 52.07 Mode :character
Mean : 72.76 Mean : 72.76
3rd Qu.: 85.77 3rd Qu.: 85.77
Max. :5000.00 Max. :5000.00
sizes month year
Length:9709 Length:9709 Min. :2015
Class :character Class :character 1st Qu.:2019
Mode :character Mode :character Median :2019
Mean :2019
3rd Qu.:2019
Max. :2019
glimpse(new_df2)
Observations: 9,691
Variables: 7
Groups: id [8,643]
$ id [3m[90m<chr>[39m[23m "AWpIeaEfM263mwCq78AC", "AWoxI9NHAGTnQPR7qqIT", "AWpx5SOVAGTnQPR7v…
$ dateAdded [3m[90m<date>[39m[23m 2019-04-23, 2019-04-18, 2019-05-01, 2019-04-20, 2019-04-19, 2017-…
$ brand [3m[90m<chr>[39m[23m "ugg", "asics", "easy street", "aerosoles", "lifestride", "easy st…
$ lowname [3m[90m<chr>[39m[23m "ugg bailey bow tall ii women's shoes boots 1016434 black", "asics…
$ prices [3m[90m<dbl>[39m[23m 383.26, 374.17, 346.65, 346.65, 346.65, 346.65, 346.65, 338.95, 31…
$ category [3m[90m<chr>[39m[23m "boots", "others", "boots", "boots", "flat", "sandal", "sandal", "…
$ discount [3m[90m<chr>[39m[23m "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
new_df2 %>%
group_by(brand) %>%
summarize(price_range=(max(as.numeric(prices))-min(as.numeric(prices))),
count = n()) %>%
select(brand, count, price_range) %>%
arrange(desc(price_range)) -> df1
head(df1,10)
Last time we have some brand that have price range for around $5000, by working more on this dataset, we find out that there are some product sell for $5000. We did some data cleaning on this dataset to get rid of those outliers. After done with the data cleaning, we got our result for price distribution of each brand. Here are top 10 brand that have the widest price distribution range. We can see that brands usually have price range smaller than $350.
We observe that boots habe the highest count and the average price is $67. Sandals have the second highest count and the average price is $57. More average prices are showned as below:
new_df2 %>%
group_by(category) %>%
summarize(
avg_price_by_category = mean(prices, na.rm = TRUE),
count= n()
) %>%
select(category, count, avg_price_by_category) %>%
arrange(avg_price_by_category)-> avg_price_category
head(avg_price_category, 9)
We make a summary of prices in different categories, and surprisely see that the average price in all categories are below $100.
tmp2 <- new_df2 %>%
filter(category != "others") %>%
ggplot(aes(x = category, y = prices)) +
stat_summary(fun.y = "mean", fun.ymin = min, fun.ymax = max, colour = "#7C7F91") +
geom_hline(yintercept = 100, linetype = "dashed", color = "red")+
xlab("Category of Shoes") +
ylab("Prices") +
labs(title = "Summary of Prices in Different Categories of Shoes") +
theme(plot.title = element_text(hjust = 0))+
theme_classic()
ggplotly(tmp2)
The summary above shows the average price of most of categories are way lower than the median. So we decided to deep into the average price which are under $200 in order to see what the data can tell us about. As the boxplot shows that the the most price distribution are below $100, we decided to choose the sample of price less than 100 to do the further analysis.
new_df2 %>%
filter(category != "others", prices < 200) %>%
ggplot(aes(x = category, y = prices))+
geom_boxplot(color = "#7C7F91")+
theme_classic()+
labs(title = "Price Among Categories") ->price_category
ggplotly(price_category)
Here is the comparision among different categories, avg_price and count. We was planning to choose the top two of highest count categories to see the price trend over the past 4 years. But we found that there are missing data for both sandal and flat before 2016. Therefore, we chose to compare the price trend for both boots and pump.
tmp1 <- avg_price_category %>%
filter(category != "others") %>%
ggplot(aes(x = category, y = avg_price_by_category, size = count)) +
geom_point(color = "darkseagreen3") +
theme_bw() +
xlab("Category of Shoes") +
ylab("Average Price in Each Category") +
labs(title = "The Count of Average Price in Different Categories") +
theme(plot.title = element_text(hjust = 0))
ggplotly(tmp1)
In order to see the price trend over the past 4 years, we format the date coloum to year and month, and also we computed the average price based on category and seperately filter both boots and pump category as our oberservations. In the end, we plot two lines for both category in one gragh in order to compare the price trend from time to time. As a result, we can see that the average price for pump shoes is steadily increase, and the average price for boots had a significant increase from December, 2015 to October, 2016 and them drop significantly after that until July, 2017. Overall, we can see the average price for boots is more fluctuating while pumps shoes is more stable.
format(new_df2$dateAdded, "%Y-%m") -> new_df2$dateAdded
new_df2 %>%
group_by(dateAdded, category) %>%
summarize(
avgprice_by_ym = mean(prices)
) %>%
arrange(dateAdded) -> avgprice_by_ym
avgprice_by_ym %>%
filter(category =="boots") ->boots
avgprice_by_ym %>%
filter(category =="pump") ->pump
p = ggplot() +
geom_line(data = boots, aes(x = dateAdded, y = avgprice_by_ym, color = category, group = 1))+
geom_line(data =pump, aes(x = dateAdded, y = avgprice_by_ym, color = category, group = 1))+
theme_classic()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(title = "Trend of Avg.Price in Both Boots and Pump")+
xlab("Year-Month")+
ylab("Average Price")
ggplotly(p)
First, we create a copy of dataframe from the previous part.
shoes_price <- new_df2
glimpse(shoes_price)
Observations: 9,691
Variables: 7
Groups: id [8,643]
$ id [3m[90m<chr>[39m[23m "AWpIeaEfM263mwCq78AC", "AWoxI9NHAGTnQPR7qqIT", "AWpx5SOVAGTnQPR7v…
$ dateAdded [3m[90m<chr>[39m[23m "2019-04", "2019-04", "2019-05", "2019-04", "2019-04", "2017-01", …
$ brand [3m[90m<chr>[39m[23m "ugg", "asics", "easy street", "aerosoles", "lifestride", "easy st…
$ lowname [3m[90m<chr>[39m[23m "ugg bailey bow tall ii women's shoes boots 1016434 black", "asics…
$ prices [3m[90m<dbl>[39m[23m 383.26, 374.17, 346.65, 346.65, 346.65, 346.65, 346.65, 338.95, 31…
$ category [3m[90m<chr>[39m[23m "boots", "others", "boots", "boots", "flat", "sandal", "sandal", "…
$ discount [3m[90m<chr>[39m[23m "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", …
To analyze the price trend among distinct brands, we create a list of popular brands by the number of shoe pairs in this dataset and the average price of each brand, both in descending order. Then we list the top 10 brands (see table below).
top_brand <- shoes_price %>%
group_by(brand) %>%
summarize(count = n(),
brand_AVG = mean(prices, na.rm = TRUE)) %>%
arrange(desc(count)) %>%
head(10)
top_tbl <- top_brand
colnames(top_tbl) <- c("Brand", "Count", "Brand Average")
png("top4.png")
top4 <- tableGrob(top_tbl, rows = rownames(NA))
grid.arrange(top4)
The highest count of the pair number is 1002, which falls in Brinley Co.. There are two brands with extremely high brand averages (above $100), SAS and L’artiste. The top 5 brands all have count above 400 pairs, but their brand averages vary alot. The next step is to compare the variance in prices for these brands.
We make a box plot on brand (listed in alphabetical order of brand names), and the coral horizontal line is the mean price of all shoes in this dataset.
shoes_price %>%
semi_join(top_brand, by = "brand") -> popBrands
ggplot(popBrands) +
geom_boxplot(aes(x = brand, y = prices, col = brand)) +
scale_color_manual(values = c("#7C7F91", "#ADDED4", "#B5D4E9", "#C3CED9",
"#C9E4B4", "#CDBCB6", "#DBBDE5", "#F0C3A3",
"#FBCDBE", "#FDDDAA")) +
labs(title = "Price Distribution by Brands",
x = "Brands",
y = "Price",
color = "Brand") +
geom_hline(yintercept = median(popBrands$prices, na.rm = TRUE), color = "indianred3", size = 0.5, alpha = 0.7) +
geom_text(x = 9, y = 330, label = "mean = 71.39", col = "ivory4") +
theme_classic() +
theme(axis.text.x = element_text(angle = 15))
The box structure in box plots represents where the middle 50% of the data points are, so we can cross-compare the distributions of top 10 brands by comparing the positions of boxes around the dataset average line. The majorities of SAS and L’artiste have prices larger than the dataset average, while these two brands also have the highest mean prices. Similarly, most shoes in Brinley Co. and Soda have small prices than the dataset average.
Some of the box plots illustrate the effect of the outliers. Keep focusing on L’artiste, we find that the middle 50% percentile, median, and mean prices are all much greater than the dataset mean, and this observation can be attributed to the effect of overall higher prices and outliers in L’artiste. In contrast, Areosoles, another brand with unignorable number of outliers, holds a mean price slightly higher than the dataset mean. Without looking at the box, we can infer that the majority of prices are smaller than the dataset mean, and the box plot confirms the implication (the median of Aerosoles is below the line).
To get a snippet of the popular price range among popular shoe brands, we choose the top 3 brands (Brinley Co., SAS, and Trotters) and create a combined density plot (as an alternative of histograms) with vertical lines representing the brand means.
top3 <- top_brand[1:3, ]
shoes_price %>%
semi_join(top3, by = "brand") -> sampleBrand
ggdensity(sampleBrand, x = "prices", y = "..count..", rug = TRUE,
add = "mean",
add.params = list(color = c("#7C7F91", "#CDBCB6", "#FDDDAA"), linetype = "longdash", size = 1),
color = "brand", fill = "brand",
palette = c("#ADDED4", "#B5D4E9", "#DBBDE5"),
xlab = "Price", ylab = "Number of shoes",
title = "Price distribution of sample brands",
legend = "right")
First, we would like to inspect the trends individually. The distributions of Brinley Co. and SAS are right-skewed, which is the effect out high value shoes. Trotters displays a relatively “binomial” distribution, because the prices tend to diverge to the edges instead of merging at the center. Then we observe two overlapping areas: one is [40, 60] and the other is [110, 160].
We can define these two ranges as the “popular” ones based on the sample properties. Top 3 brands cover nearly 20% ((1002 + 455 + 451)/9709) of the shoes. Such portion of the data size can somehow reflect the retailer’s favor. Another interesting pattern in this sample is the relative location of price statistics (mean). Brinley Co. and SAS have extreme mean values (as we have mentioned in the box plot). Trotters has a mean of $83, pretty close to the overall mean. This pattern makes the 3-observation sample quite representative; along with their popularities, we can conclude that the overlapped ranges are reliable to suggest the popular ranges.
At the end of the price inspection, we plot a density graph for all shoe prices regardless of their categories and brands:
dist_all <- ggdensity(shoes_price, x = "prices", y = "..count..",rug = TRUE,
fill = "#ADDED4", color = "#ADDED4", alpha = 0.7,
add = "mean",
add.params = list(color = "#7C7F91", linetype = "longdash", size = 1),
xlab = "Average Price", ylab = "Number of shoes",
main = "Price distribution of shoes") +
coord_cartesian(xlim = c(0, 300)) +
theme_classic()
Using both `xintercept` and `mapping` may not have the desired result as mapping is overwritten if `xintercept` is specified
ggplotly(dist_all) %>%
layout(title = list(text = paste0("Price distribution of shoes",
"<br>",
"</sup>")))
The overall trend is right-skewed, even after we have removed outliers ($500 and above). As we have observed in the category box plot, most prices fall below $100. So if we set $100 as an upper bound to our data pool, the overall distribution would look more “normal”, which corresponds the statistical lemma that a larger sample size (compared to single brand trend) tends to display a more “Normal” distribution. Moreover, there are two density peaks, [45, 60] and [120, 150], which also overlaps with the popular range of Top 3 brands. Hence we can consider these ranges as “popular” to retailers.
We have analyzed a sample of ~9,700 of online women’s shoes listings and came to conclusion that most brands tend to offer more low prices shoes rather than high prices ones. Although some brands offer expensive shoes, they also have cheap options. This observation indicates that brands tend to target low to medium level income audience by offering “affordable prices”.
We have also identified the price distributions in category and brand. The general distribution pattern is right-skewed. In terms of popular prices, brands usually keep their price interval less than $350. We have also observed that the larger sample size tends to employ a more “Normal” distribution.
In addition, we have noticed some categorical trends. Black, white, shades and beige shoes keep earning favors in different seasons. Size 7 and 8 have a larger potential customer group. Most manufactures tend to launch new products in April and September. All these trends could reflect shoe industry’s preference from 2015 to 2019, and fashion analysts could predict future trends with more research.
Based on the summary, we would be able to provide some recommendations to those who are referring to Women's Shoes dataset. When purchasing products, retailers could focus more on black and white shoes, or other color tones in response to season change. They may also need to identify target groups of customers first to set price for shoes. Two frequent ranges, $45 - $60 and $120 - $150, can be references on customer’s preference based on different types of shoes.